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ABSTRACT 

In considering the problem of measuring achievement 
for the evaluation of school effectiveness, there are at least three 
questions that need to be answered: (1) What is to be measured? (2) 
How is it to be measured? (3) How are the results to be analyzed? 
Following a discussion related to /the first two 
questions — determining content objectives and selecting or 
constructing tests that match the school's curriculum — attention is 
focused on the problems of translating test results into measures of 
school effectiveness. Primary consideration is given to what kinds of 
test scores should be used for analysis. The following types of 
scores are discussed: (1) global scores from survey tests, including 
the use of different forms of the same test; (2) average scores on a 
norm-referenced test or passing rates on a criterion-referenced 
test — including ranking in terms of status scores or trends in means 
for a grade, use of an SES indicator to adjust scores, and use of 
regression analysis to adjust for bias in mean gain scores; and (3) 
pretest/posttest scores, including three approaches for going beyond 
discussions of school means. The author concludes that comparisons of 
observed posttest results to thctee predicted from a regression of 
posttest on pretest "scores seeing the soundest approach to using 
achievement data as indices of school effectiveness. (LC) 
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Frechtling (1983) has identified a number of measurement issues and 

dilemmas in the evaluation of school effectiveness. She has divided these 

* into two general areas: the measurement of achievement and the measure- 
rs 

/ rrtent of school processes and climate. The focus of this paper is on the 

i 

* 

issues in just one of these areas: the measurement of achievement. 
Although, these comments will not provide anything like definitive answers 
to the tough problems that Frechtling has identified, they will highlight 
advantages and disadvantages of particular approaches, and show why 
some approaches are preferable to other commonly used ones. 

In considering the problem of measuring achievement for the evaluation 
of school effectiveness, there are at least three questions that need to be 
answered: What is to be measured? How is it to be measured? and How 
are the results to be analyzed? Although the third question sounds more 
like a statistical question than a measurement question, it is an essential 
component of the validity of any inferences about the school effectiveness 
based upon student achievement data. Indeed, the flaws that Frechtling 
mentioned in regard to th^four practices of using average scores, average 
gains,, passing rates or differences in passing rates are largely the con- 
sequence of analytic shortcomings for the desired inference from the data. 

What and How to Measure 

The questions of what and how to measure is obviously of central 
importance. A widely circulated ETS phampiet entitled Selecting an 
Achievement Test: Principles and Procedures (ETS, 1969) astutely notes 
that "Before deciding what we want to test . . . , we must have a clear 
identification of what we want to teach" (p. 14). What are the curric- 
ular objectives and in terms of which of those objectives should school 
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effectiveness be evaluated? After these questions have been answered the 
process of constructing or selecting appropriate tests can begin. Too 
frequently, this first step of" deciding what should be measured is given 
too little attention, indeed the process may even be reversed, .that is, a 
test may already be in place and its degree of match to the curriculum 

\ 

judged after the fact. 

Rowan, Bossert and Dwyer (1 983, p. 25) have noted that "Past 
research has defined school effectiveness narrowly as instructional effec- 
tiveness and has measured this construct using standardized achievement 
tests," They go on to argue that this narrow approach ignores many 
important goals, ^his is quite true, however, instructional goals , are 
clearly important and, as Rowan, et al. also note, there are substantial 

difficulties in adequately measuring even this aspect of 'effectiveness. 

f 

A difficult issue in defining the knowledge and skills to be tested is 
the question of whethjer measurement should be limited to a core that is 

0 

common to all schools to be studied or include relatively unique objectives 
that are pursued by <jmly a few schools. Limitation to a common .core may 
conceal some of the most important differences between schools. On the 
other hand, including items that measure objectives unique to a few 
schools may greatly increase the testing burden and be considered unfair 
to schools that do not pursue those objectives. 

To the extent that it is feasible, it is important to go beyond the 
common core and provide some coverage of content that is emphasized by 
some but not other schools. The two categories of items must be treated 
separately in the analysis and doing ^o may complicate conclusions. Con- 
sider, for example, two hypothetical schools: the curriculum at school A 
includes objectives in computer literacy while the school B curriculum does 
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not. Skills in basic arithmetic operations, on the other hand, are part of 
the curriculum at both schools* The analysis of the achievement 
results indicate that school B is more effective than school A in terms of 
arithmetic operations, but schco' A is more effective than B irj terms of 
the computer literacy measjre. There is no simple answer to the question 
of which school., is more effective That judgment depends on the value 
attached to the two areas of achievement and that judgment will surely vary 
from one individual to another. r But the greater complexity is surely a 
more complete picture than would be obtained from a comparison limited to 
the common core of arithmetic operations. 

Once the content objectives have been determined the process of test 
selection or construction can begin. At this point there is apt to be a 
debate about the relative merits of norm-referenced and criterion-referenced 
measures. However, these labels should not be the primary consideration. 
It is an analysis of test content in which judgments are made about the 
match between the test and {he curriculum and the likely sensitivity of the 
test to school differences that is crucial. 

Test publishers rely on fairly similar techniques that depend on 
careful analysis of widely used curriculum materials to define the content 
coverage of their tests. They produce tests with similar names . The 
test scores of the most nearly comparable tests of different publishers are 
highly correlated (e.f. , Bianchini & Loret, 1974). These similarities con- 
ceal potentially important differences in detailed content coverage and the 
match of coverage to the curriculum, however. Detailed comparative 
analyses such as those reported by Hoepfner (1978) for reading tests 
and by Porter, Schmidt, Floden and Freeman . (1973) for mathematics 
tests reveal surprisingly large differences between, tests. Futhermorc, 



the degree of overlap between what is taught and what is tested has been 
found to be closely related to performance (e.g., Bianchini, 1978; Leinhart, 
1983). 

The importance of overlap between test content and either curriculuny 

f 

materials or teacher reports of instruction is illustrated by^ Leinhart'k 

(1983.) summary of the results of two studies. !n one of those studies 

»teachers identified curricula used with each student. A computer list of 

the words in the curriculum materials used for each student. A list of 
t 

words on the test was also compiled. These lists were then used to obtain 
estimates of overlap for each student. With pretest pai cialled out, the 
correlation between the posttest and overlap was .38. Similar results were 
obtained using instruction-based estimates of overlap. * Such results 
suggest that consideration of overlap may be critical in studies of school 
effectiveness. 

Analysis 

Although the choice of the tests to be used in an evaluation of school 
effectiveness is of crucial importance, no further consideration will be 
given to that issue here. It is a topic that is given considerable atten- 
tion, not only in most tests and measurement textbooks, but in most test 
manuals. This is not to say, of course, that the advice in these sources 
is always heeded. But even if the tests are chosen with carer several 
obstacles stand in the way of translating the test results into measures of 
school sffectiveness. Some of these have been clearly stated by Frechtling 
(1983): should status or gain be analyzed and should averages or crite- 
rion attainment*^ used? It seems to me, however, that a prior question 
is what scores should be used? 
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A single global score in mathematics may serve some purposes and may 
be the only mathematics score with sufficient reliability at the individual 
student level. However, the focus in a study of school effectiveness is at 
a different level and global scores may conceal differences that exist for 

i 

finer breakdowns of the content. The intermediate level mathematics sur-> 
vey test of the Metropolitan Achievement Tests (Prescott, Balow, Hogan &' 
Farr, 1 978), for example, consists of 50, items that span 7 .content areas. 
These' are numeration, geometry and measuTer^^ solving, whole 

numb3r operations, laws and properties of operations, fraction and decknal 
operations, and graphs and statistics. The number of items per content 
area ranges from 3 to 13. Although the number of items in a single con- 
tent strand is too few for reliable individual measurement, separate content 
scores for^ the content strands /nay have utility at the school level. Of 
course, more items per strand would yield greater fidelity and this could 
be accomplished by using the Intermediate MAT Mathematics Instructional 
Tests which cover the same 7 content areas with between 18 and 42 items 
per area for a total of 204 items. The tradeoff of greater fidelity for 
these areas of mathematics is apt to be a narrower bafxdwidth , i.e., better 
measurement in mathematics at, the expense of less coverage in other areas. 

An alternative \^pproach that can enhance both fidelity and bandwidth 
at the school level is have students respond to different tests. For 
example, the number of items in the content strand would be doubled by 
administering Form JS of the MAT Survey Battery to half the students, and 
Form KS to the other half, thereby providing between 6 and 26 items per 
content strand for school level analysis and still having complete survey 
test scores for individual students. 
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One or Two Administrations , Comparing schools in the results of a, 

single test administration whether in terms of average scores on a norm- 

referenc^d^test or passing rates on a criterion-referenced test has serious 

flaws. As Frechtling (1983, p. 3) noted, one is apt to find that "the apparently 

most successful school is that, serving the wealthiest students from the 

best educated families. " Even when trends in test scores, e.g. the mean 

scores for a particular grade of a school over* several years (Phi Delta 

Kappan, 1980) are used, the increases or decreases are apt to reflect 

changes in n the socioeconomic composition of a school 1 s student body J 1 

(Rowan, Bossert, & Dywer, 1983; see also, Rowan & Denk, 1982). Hence 

simple ranking in t^rms of status scores or trends in means for a grade 

> cannot be considered a fair measure of school effectiveness. At a minimum 

comparisons must take int9 account differences in socio-economic status. 

Some test publishers provide special report services that incorporate 

adjustments for differences in socio-economic status. The MAT again* 

peovideS an illustration of this approach. The SES predicted achievement 

report (The Psychological Corporation, 1981) provides comparison of 

* 

obtained mean achievement scores for a school to ranges of scores pre- 
dieted from a parental education index. Schools are located in one of five 
score bands based on a regression analysis of school means in the MAT 
school norms. Nationally, schools wouJd be expected to be distributed 
with about 10%, 20%, 40%, 20% and 10% of the schools in bands 1 through 5 
respectively. 

The use of an SES indicator to adjust scores is an improvement over 
simple ranking on achievement scores, but the adjustment may not be 
adequate. In Cronbach's (1 982 , p. 191) terminology, SES is almost*surely 
an "incomplete covariate." That i*. it provides only a partial adjustment 
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for preexisting differences outside the control of the school. Finding the 
complete covariate is undoubtedly an ijlusary goal but prior achievement 
probably provides the closest feasible approximation. For this reason, 
estimates of school effectiveness are more dependable when pretest results 
are used to adjust posttest performance. Gain scores are the simplest, 
but not the best way of making the desired adjustment. A regression 
effect may bias the results of an analysis of mean gain scores. While 
slightly more complicated, a regression analysis, alleviates this problem, 

A regression approach to*deriving indices of school effectiveness has 
been described by Dyer (1 966, 1970a, J970b) and several authors (e,g* 
Dyer, Linn S Patton, 1960; Forsyth, 1973; Marco, 1974; Marco, Murphy S 
Quirk, 1976; Rowan & Denk, 1982) have investigated variations and prop- 
erties of this general approach. In its simplest form; school posttest 
means are regresse^ on school" pretest means and school performance 
indices are based on deviations of observed posttest means from their 
predicted values , Other predictors, e.g. means on otheY pretests or 
measures of SES, may also be incorporated in the regression, but are apt 
to improve the predictive power of the pretest relatively little,, * 

A dilemma in this approach, which also applies to the use of. average 
gains, is Caused by missing data. Some students will have pretest scores 
but no posttest scores while the converse is true of others. As shown by 
Dyer, Linn and Patton (1969) the results for cases with complete data 
(both pretest and posttest), may differ from those based on means for all 
students with one or both scores. The complete data results may be based 
on only a small fraction of the students served by a school where mobility 
is high. On the other hand, it' seems unreasonable to attribute effects to 
a school based on changes in the student body duo to mobility. Hence, 
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the use of only cases with both pretest and posttest scores seems pre- 
ferable. Student mobility is a relevant variable to consider in interpretinq 

1 

those results, however. 

In her discussion of the use of average gain scores, Frechtling (1983) 
observed that there seems to be little consistency between school gains 
from one year to the next, with correlations between mean gains of only 
about .2 or .3. She concluded that "either school effectiveness is a very 
fragile thing or the metric used, the gain score, has serious problems. 
As was previously stated, gain scores less adequate than scores based on 
a regression approach. However, the evidence suggests that the latter 
approach may produce results that are.no less fragile. Forsyth (1973) 
investigated the stability of school residuals from regressions of school 
mean posttest on pretest scores from one year to the next. The median 
correlation between residuals for 10 different scores was only ".28, with a 
range from a low of .11 W Ability to do Quantitative Thinking to a high 
of .50 for Social Studies. Similar results have been reported by Jencks, 
et al. 972 ) and by Rowan and Denk (1982). Although* these results may 
be more fragile than seems desirable, they may also reflect reality, at least 
at the level of general composite scores. It may be that somewhat more 
stable scores would be obtained by rrv specific content scores of the 
type illustrated earlier, 'it also may be that v scores for content areas 
where there is less uniform agreement on coverage across schools, such as 
the computer literacy example used above, would yield more stable results 
from year to year than those obtained for the core areas. Furthermore, 
even' this limited amount of stability may be sufficient for contrasting 
extremes, e.g. the 10% of the schools with the largest positive residuals 
with tlje 10% that have the largest negative residuals. 
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Beyond Averages . So far the focus has been only on school means. 
This focus is understandable, but it ignores other possible differences 
between schools that are of potential interest. Two schools may appear 
quite similar in terms of average scores but differ considerably in the 
performance of the highest and lowest achieving schools, • That is, a given 
average gain may be obtained as the result of large gains for initially, low 
scoring students and only modest gains for initially low scoring students. 
Conversely, the same average gain may be achieved by large gains at the 
upper end of the distribution at the expense of gains at the lower end* 

Two approaches have been suggested^fbr^going beyond school means. 
Dyer, Linn and Patton (1969) used the 20th and 80th within-school percen- 
tiles in addition to, school means. Posttest scores at the 80th percentile, 
* for example, were regressed on pretest scores at the. 80th percentiles. A 
comparable 'analysis was performed using within-school 20th percentile 
scores. Schools that are identified as effective using means were not 
necessarily the same as those that were identified using . 20th or 80th 
percentile points. * 

A N n alternative approach that takes into account differential effects on 
initially high and low scoring students within a school has been proposed 
by Burstein, Linn and Capell (1578), In the latter approach, within- 
schoo! regressions of student posttest on pretest scores ai^ computed. 
7 he <wi thin-school slopes are used along with^results based on scfiool means 
to describe school performance,, Attempts are then made to explain both 
sets of results in terms of school process variables. 

The gain in percent passing a prespecified standard on a criterion- 

\ » 

t • 

referenced test, which was described by Frechtling (1 983) and is the 
basis of the analysis reported by Clark and McCarthy (1983), may also 

ERIC ■ U / , 



give information not contained in an analysis of average gains. However, 
this approach is less satisfactory than either, of the two approaches 
just described. The choice of a standard is fraught with difficulty and 
the percentage increase metric has undesirable properties. It seems 
unreasonable,, for example, to consider a change from, 10 to 20 percent 
passing, comparable to a change from 85 to 95% passing. 

Conclusion * * 

While no panacea, comparisons of observed posttest results to that 
predicted .from a regression of posttest on pretest seems the soundest 
approach. v Within-school points in the distribution (e.g., 20th and 80th 
percentile) or within-school regressions as well as means are potentially 
relevant. The scores should be as content specific as feasible and range 
over both common and relatively unqiue objectives. Even so, the evidence 
suggests that only a modest degree. of stability can be expected from one 
year to- the next. Hence, it seems wise to avoid drawing conclusions from 
small differences in indices of effectiveness. 
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